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ABSTRACl’ 

We  study  the  power  of  shared-memory  in  models  of  parallel  computation.  We  describe 
a  novel  distributed  data  structure  that  eliminates  the  need  for  shared  memory  without 
significantly  increasing  die  run  time  of  the  parallel  computation.  More  specifically  we  show 
how  a  complete  network  of  processors  can  dctcrministicly  simulate  one  PRAM  step  in 
0(log  /idoglog  n'f)  lime,  when  both  models  use  «  processors,  and  the  size  of  die  PR  AM’s 
shared  memory  is  polynomial  in  n.  d’he  best  previously  known  upper  bound  was  the  trivial 
0{n)).  We  also  establish  that  this  upper  briunds  is  nearly  optimal.  We  prove  tiiat  an  on-line 

simulation  of  T  PRAM  steps  by  a  complete  network  of  processors  requires  time. 

A  simple  consequence  of  the  upper  bound  is  that  an  Ultracomputcr  (the  only  currendy 
feasible  general  purpose  parallel  machine),  can  simulate  one  step  of  a  PRAM  (the  most  con¬ 
venient  parallel  model  to  program),  in  0((log  ti  loglog  «)*)  steps. 
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1.  INTRODUCTION 

The  cooperation  of  //  processors  to  solve  a  problem  is  useful  only  if  the  following  two 
goals  can  be  achieved: 

1.  Efficient  parallelization  of  the  computaiion  involved. 

2.  Efficient  communication  of  partial  results  between  processors. 

Models  of  parallel  computation  that  allow  processors  to  randomly  access  a  large  shared 
memory  (c.g.  PRAM)  idealize  communication  and  let  us  ftKus  on  the  computation.  Indeed, 
they  arc  convenient  to  program  and  most  parallel  algorithms  in  the  literature  use  them. 

Unfortunately,  no  realization  of  such  models  seems  feasible  in  foreseeable  technologies. 
The  only  current  feasible  model  is  a  distributed  system  -  a  set  of  processors  (RAMs)  connected 
by  some  communication  network.  As  there  is  no  shared  memory,  data  items  are  stored  in  the 
processors’  local  memories,  and  information  can  be  exchanged  between  processors  only  by 
messages.  A  pr(x:cssor  can  send  or  receive  only  one  data  item  per  unit  time. 

Let  n  be  the  number  of  processors  in  the  system  and  m  the  number  of  data  items.  At 
every  logical  (c.g.  PRAM)  step  of  the  computation,  each  processor  can  specify  one  data  item  it 
wishes  to  access  (read  or  update).  1'hc  execution  time  of  the  logical  step  is  at  least  the  number 
of  machine  steps  required  to  satisfy  all  these  requests  in  parallel. 

To  illustrate  the  problem,  assume  m>n^.  A  naive  distribution  of  data  items  in  local 
memories  that  uses  no  hashing  or  duplication  will  result  in  some  local  memory  having  at  least 
n  daUi  items.  'I’hen,  a  perverse  program  can  in  every  step  force  all  prtxressors  to  access  these 
particular  datii  items.  This  will  cause  an  Q(h)  communication  bottleneck,  even  if  the  commun¬ 
ication  network  is  complete.  This  means  that  using  n  processors  may  not  have  cm  advantage 
over  using  just  one.  even  when  computation  is  parallelizable! 

We  therefore  see  that  it  is  a  fundamental  problem  is  to  find  a  scheme  to  organize  the 
data  in  the  processors'  memories  such  that  information  about  any  subset  of  n  data  items  can  be 
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retricvcd  and  updated  in  parallel  as  fast  as  possible. 

This  problem,  called  in  several  references  ‘the  granularity  problem  of  parallel  memories’, 
is  discussed  in  numerous  papers.  1'he  survey  paper  by  Kuck  [Ku]  mentions  14  of  them,  all 
solving  only  part  of  the  problem  as  they  tailor  data  organization  to  particular  families  of  pro¬ 
grams.  For  a  general  purpose  parallel  machine,  such  as  the  NYU-Ultracomputcr  (Gottlieb  et 
al.  IGGK]),  the  PDDl  machine  (Vishkin  [Vil]),  and  others,  one  would  clearly  like  a  general 
purpose  organization  scheme,  that  will  be  the  basis  of  an  automatic  (compiler-like)  efficient 
simulation  of  any  program  written  for  a  shared  memory  model  by  a  distributed  model. 

If  the  number  of  data  items,  m.  is  roughly  tlie  number  of  processors,  «,  then  tlie  fast 
parallel  sorting  algorithms,  [AKS],  and  [Lc],  solve  the  problem.  However,  we  argue  that  In 
most  applications  this  is  not  the  case.  For  example,  in  distributed  databases,  typically 
thousands  of  processors  will  perform  transactions  on  billions  of  data  items.  Also,  in  parallel 
computation,  appetite  increases  with  eating:  the  more  processors  we  can  have  in  a  parallel  com¬ 
puters,  the  larger  tlie  problems  we  want  to  solve. 

In  a  probabilistic  sense,  the  problem  is  solved  even  for  »;>«.  Melhorn  and  Vishkin 
[M VI  propose  distributing  the  daUt  items  using  universtil  hashing.  This  guarantees  that  one 

parallel  request  for  n  data  items  will  be  satisfied  in  expected  time  Upfal  [U] 

presents  a  randomized  distributed  data  structure  tliat  guarantees  execution  of  any  sequence  of 
T  parallel  requests  In  0(T  log  «)  steps  with  probability  tending  to  1  as  «  tends  to  oo. 

liy  contrast  no  dctenninistic  upper  bound  belter  than  the  trivial  0(n)  in  known. 

Melhorn  and  Vishkin  [MV],  who  provide  an  extensive  study  of  this  problem,  suggest  keeping 
several  copies  of  each  data  item.  In  their  scheme,  if  all  requests  are"  for  ‘read’  instructions,  the 

i-i 

‘easicst’  copy  will  be  read,  and  all  requests  will  be  satisfied  in  time  0\kn  *)  where  m  =  «*. 
When  update  instructions  arc  present,  tlicy  cannot  guarantee  time  better  than  0(n),  as  all 
copies  of  a  data  item  have  to  be  updated. 
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In  this  paper  we  present  a  data  organization  scheme  that  guarantees  a  worst  case  upper 
bound  of  (7 (log  H(loglog  nf),  for  any  w  polynomial  in  n.  Our  scheme  also  keeps  several 
copies  of  each  data  item.  The  major  novel  idea  is  that  not  all  of  these  copies  have  to  be  updated 
•  it  suffices  that  a  majority  of  them  are.  This  idea  allows  the  ‘read’  and  ’update’  operations  to 
be  handled  completely  symmetrically,  and  still  allows  processors  to  access  only  the  ’easiest’ 
majority  of  copies. 

Our  scheme  is  derived  from  the  structure  of  a  concentrator-like  bipartite  graph  [Pi].  It  is 
a  long  standing  open  problem  to  construct  such  graphs  explicitly.  However,  a  random  graph 
from  a  given  family  will  have  the  right  properties  with  probability  1.  As  in  the  case  of 
expanders  and  siiperconcentrators  (c.g.  (I’ij)  this  is  not  a  serious  drawback,  as  the  randomization 
is  done  only  once  -  when  constructing  the  system. 

One  immediate  application  of  the  upper  bound  is  to  the  simulation  of  ideal  parallel 
computers  by  feasible  ones.  Since  a  bounded  degree  network  can  simulate  a  complete  network 
in  0(log  n)  steps  ([AKS].  |l,c]),  a  typical  simulation  result  which  is  derived  from  our  upper 
bound  is  the  following:  Any  n -processors  PRAM  program  that  runs  in  T  steps  can  be  simulated 
by  a  bounded  degree  network  of  n  processors  (Ultracompuf erf  Sc])  that  runs  in  deterministic  time 
0{T(\og  nj^doglog  h)^)  steps. 

The  scheme  wc  propose  has  very  strong  fault-tolerance  properties,  which  arc  very  desir¬ 
able  in  distributed  systems.  It  can  sustain  up  to  0(log  n)  maliciously  chosen  faults  and  up  to 
(1  — c)/»  random  ones  witliout  any  information  or  efficiency  loss. 

Finally  wc  derive  lower  bounds  for  the  clficicncy  of  memory  organizations  schemes. 
We  consider  schemes  that  allow  many  copies  of  each  data  item,  as  long  as  each  memory  cell 
contains  one  copy  of  one  data  item.  The  redundancy  of  such  a  scheme  is  the  average  number 
of  copies  per  data  item. 

Our  lower  bound  gives  a  trade-off  between  the  efficiency  of  a  scheme  and  its  redun¬ 
dancy.  If  the  redundancy  is  bounded,  we  get  an  O(h^)  lower  bound  on  tlic  efficiency.  This 
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result  partially  explains  why  previous  attempts,  that  considered  only  bounded  redundancy  foiled 
[MV],  and  why  our  scheme  uses  0(log  «)  copies  per  data  item. 

We  also  derive  an  ”  )  unconditional  lower  bound  on  the  efficiency  -  almost 

loglog  n 

matching  our  0(log  « (loglog  n)^)  upper  bound.  This  lower  bound  is  the  first  result  that 
separates  models  with  shared  memory  from  the  feasible  models  of  parallel  computation  that 
forbid  it 

2.  DKI'lNinONS 

To  simplify  the  presentation,  we  shall  concentrate  on  simulation  of  the  weakest  shared 
memory  model  -  the  F.RRW  (Exclusive-Read  Exclusive-Write)  PRAM,  by  the  strongest  distri¬ 
buted  system  -  a  model  equivalent  to  a  complete  network  of  processors.  Extending  this  result 
to  a  simulation  of  a  the  strongest  PRAM  model  (the  CRCW  PRAM)  by  a  bounded  degree  net¬ 
work  of  prcKessors  (an  Ultracomputer)  requires  standard  techniques,  which  we  shall  mention  at 
the  end  of  section  3. 

An  KREW  PRAM  consists  of  n  processors  Pj . P„,  (RAMs)  which  operate  syn¬ 

chronously  on  a  set  t/  of  m  shared  variables  (or  data  items).  In  a  single  PRAM  step,  a  proces¬ 
sor  may  perform  some  internal  compuLition  or  access  (read  or  update)  one  data  item.  Each 
data  item  is  accessed  by  at  most  one  processor  at  each  step. 

An  MPC  (Module  Parallel  Computer)  [MV]  consists  of  n  synchronous  processors, 

Pi . P„,  and  n  memory  modules,  Mi . M„.  livery  module  is  a  collection  of  memory 

cells,  each  of  which  can  store  a  value  of  one  data  item. 

In  each  MPC  step,  a  processor  may  perform  some  internal  computation,  or  request  an 
access  to  a  memory  cell  in  one  of  the  memory  modules.  From  the  set  of  processors  trying  to 
access  a  ^ccific  module,  exactly  one  will  (arbitrarily)  be  granted  permission.  Only  this  proces¬ 
sor  can  consequently  access  (read  or  update)  exactly  one  cell  in  this  module. 
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llic  task  of  the  MPC  is  to  execute  a  PRAM  program.  This  program  is  a  sequence  of 

instructions  I,,  /  =  1 . T.  Each  instruction  is  a  vector  of  n  sub-instructions,  specifying  the 

task  of  each  of  the  n  processors  in  this  instruction.  The  sub-instruction  of  the  processor  P,-  can 
be  either  to  execute  some  local  computation,  or  to  access  (read  or  update)  a  data  item  (shared 
variable)  In  the  case  of  an  update,  a  new  value  v,-  is  also  assigned. 

For  the  simulation,  each  data  item  u£U  may  have  several  ‘physical  addresses’  or  copies 
in  several  memory  modules  of  the  MPC,  not  all  of  which  arc  necessarily  updated.  Let  r(«)  be 
the  set  of  modules  containing  a  copy  of  w.  We  sometimes  refer  to  Ru)  also  as  the  set  of 
copies  of  u . 

The  essence  of  the  simulation  is  captured  by  an  organization  scheme  S.  It  consists  of 
an  assignment  of  sets  r(i/)  to  every  u£U,  together  with  a  protocol  for  execution  of 
rcad/update  instructions  (c.g.  how  many  copies  to  access,  in  what  order,  etc.).  Both  the  assign¬ 
ment  and  the  protocol  may  be  time  dependent. 

A  scheme  is  consistent  if  after  the  simulation  of  every  PRAM  instruction  /,,  a  protocol 
to  read  data  item  u  terminates  with  the  value  assigned  to  u  by  the  latest  previous  write  instruc¬ 
tion. 

1'hc  efficiency  of  a  given  scheme  S  is  die  worst  case  number  of  parallel  MPC  steps 
required  to  execute  one  PRAM  instruction  (according  to  the  protocol).  Note  that  the  worst 
case  is  taken  over  all  possible  /t-subsets  of  the  set  of  data  items  U,  and  over  all  possible  access 
patterns  (rcad/write). 

2ir(tt)| 

Finally,  we  define  the  redundancy  r(A’)  of  S  (at  this  step),  to  be  r(S)  =  . - , 


the  average  number  of  copies  of  a  data  item  in  the  scheme  at  diis  step. 
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3.  UPPKR  BOUNDS 

Our  main  results  arc  given  below. 

THEOREM  3.1:  Ifm  is  polynomial  in  n  ihen  there  exists  a  consistent  scheme  whose 
efficiency  is  O(log  «(loglog 

Theorem  3.1  is  a  special  case  of: 

THEOREM  3.2:  There  is  a  constant  bo  >1.  s,L  for  every  b>bo  and  c  satisfying 
b''  >  nP.  there  exists  a  consistent  scheme  with  efficiency 
0(b[c  (log  c)^  +  b  log  n  log  c]). 

In  our  scheme,  every  item  «  €  f/  will  have  exactly  2c -1  copies,  i.c,  |r(i/)|  =2c-l. 
Each  copy  of  a  data  item  is  of  the  form  <valuc,  time-stamp>,  before  the  execution  of  the  first 
instruction  all  the  copies  of  each  data  item  contain  identical  value  and  arc  time  stamped  ‘O’. 
We  will  show  later  how  to  locate  the  copies  of  each  data  item. 

'fhe  protocol  for  accessing  data  item  u  at  the  /**  instruction  is  as  follows: 

1.  To  update  u,  access  any  c  copies  in  FCm),  update  their  values  and  set  their  time- 
stamp  to  t. 

2.  To  read  w.  access  any  c  copies  in  r(M).  and  read  the  value  of  the  copy  with  tlic  latest 
time-stamp. 

This  protocol  completely  symmetrizes  the  roles  of  read  and  update  instructions,  and 
gives  a  new  application  to  the  majority  rule  used  in  rH'l  for  concurrency  control  of  distributed 
databases. 

LEMMA  3.1:  The  scheme  is  consistent. 

PROOF:  We  say  that  a  copy  y j(u)  of  the  data  item  u  is  updated  after  step  t,  if  it  con¬ 
tains  the  value  assigned  to  m  by  the  latest  previous  write  instrucUon. 

From  the  fact  that  every  two  c -subsets  of  Tiu)  have  a  non-empty  intersection,  it  follows 
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by  induction  on  t  that  when  the  simulation  of  every  instruction  I,  terminates,  at  Iwst  c  copies 
of  every  data  item  u  arc  updated,  these  copies  have  the  latest  time  stamp  among  ati  the  copies 
of  u,  and  a  read  u  protocol  would  return  their  value.  □ 

Let  Uj  be  the  data  item  requested  by  P,.  l</<n,  at  this  step.  Recall  that  c  copies  in 
r(«|)  have  to  be  accessed  in  order  to  read  or  update  .  Denote  the  copy  in  r(u)  by  yjiu). 
During  the  simulation  of  tliis  instruction,  we  will  say  that  y/w,)  is  alive  if  this  copy  was  not 
accessed  ycL  Also,  say  that  u,  is  alive  if  at  least  c  copies  in  r(M,)  arc  still  alive.  Notice  that  a 
request  for  u,  is  satisfied  when  Ui  is  no  longer  alive.  At  this  point  the  protocol  for  accessing  w,- 
can  terminate. 

Wc  arc  ready  now  fo  describe  the  algorithjn.  We  start  with  an  informal  dcscripUon. 

Assume  that  the  task  of  P,  is  cither  to  read  u,-  or  to  update  its  value  to  v,.  Processors 
will  help  each  other  to  access  these  data  items  according  to  the  protocol.  It  turns  out  to  be 

efficient  if  at  most  — data  items  arc  processed  at  a  time,  'rherefore.  wc  shall  partition  the 

set  of  processors  into  k  =  jfz'i  each  of  size  2c  - 1.  'Ilicrc  will  be  2c  phases  to  the 

♦ 

algorithm.  In  each  of  the  phases,  each  group  will  work,  in  parallel,  to  satisfy  the  request  of  one 
of  its  members.  This  will  be  done  as  follows:  The  current  distinguished  member,  say  Pj,  will 
broadcast  its  request  (access  and  the  new  value  v,-  in  case  of  a  write  request)  to  the  other 
members  of  its  group.  Each  of  them  will  repeatedly  try  to  access  a  fixed  distinct  copy  of  ii<. 
After  each  step,  the  proccs.sors  in  this  group  will  check  whether  u,  is  still  alive,  and  at  the  first 
time  it  is  not  alive  (i.c.  at  least  c  of  its  copies  were  accessed),  this  group  will  stop  working  on 
Ui.  If  the  request  was  for  a  read,  the  copy  with  the  latest  time  stamp  will  be  computed  and 
sent  to  P/. 

Each  of  the  first  2c  -1  phases  will  have  a  time  limit,  that  may  stop  die  processing  of  the 

k  data  items  while  some  arc  still  alive.  However,  wc  will  show  that  at  most  — - —  from  tlic  k 

2c -1 

items  processed  in  each  phase  will  remain  alive.  Hence,  after  2c -1  phases  at  most  k  items 
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will  remain.  Thciie  will  be  distributed,  using  sorting,  one  to  each  group.  The  last  phase,  that 
has  no  time  limit,  will  handle  them  till  ail  are  processed. 

Fpr  the  formal  presentation  of  the  algorithm,  let  /’(/-iX2e-i)+i.  /  =  ,2c  -1  denote 

the  processors  in  group  1, 1  =  1. ...  .k,  k=  .  The  structure  of  the  y'*  copy  of  the  data 
items  u  is,  as  before,  <valuej{u),time—siampjiu)>. 


Phase  (i  .Umejimit): 
begin 

processorjio  . 

'  2c-l  ' 

/:=(/  -  lX2c  -  1): 

P/+/  broadcast  its  request 
lread(M/+,  )  or  update(M/+,,v/+,)] 
to  /  f  .  Pf  .|.2c— 1, 

live{u/+,):=/n<c; 
count: =0; 

while  livc(u/4/)  and  count  <  timejimit  do 
count  :=  count+1; 

Pf+j  tries  to  access  y/uf+i)', 
if  permission  granted  then 
if  read  request  then 
read  <valucj(uf+i).  iime_s/ampj(uf+i)>\ 
else  (update  request) 

<v(iluej(u/+i).  imie_sfamp(uf+i)>  :=  <vy+/,/>; 
if  less  than  c  copies  of  Uf^./  arc  still  alive  then 
\\\c(uf+i)\=  false’, 

end  while 

if  a  read  request  then 

find  and  send  to  the  value  with  the 
latest  timc_stamp; 
end  Phase  i ; 

I'he  algorithm: 
begin 

for  (  =  1  to  2c  — I  do 
run  Phasc(/,log,j4c); 

(  for  a  fixed  i)  (to  be  calculated  later), 
there  are  at  most  k  live  request  at  this 
pi)int  of  the  algorithm] 
sort  the  k'  live  requests  and  route  them  to 
the  first  processors  in  tlie  k'  first  groups, 
one  to  each  processor; 
run  Phasc(l,log,,n); 
end  algorithm. 


Consider  now  one  iteration  of  the  while  loop  in  an  execution  of  a  phase  in  (he  algo* 
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rithm.  The  number  of  requests  sent  to  each  module  during  the  execution  of  this  iteration  is 
equal  to  the  number  of  live  copies  of  live  dato  item  this  module  contains.  The  module  may 
receive  all  the  requests  togctlicr  and  therefore  process  only  one  of  them,  thus  we  can  only 
guarantee  that  the  number  of  copies  processed  in  each  iteration  of  the  while  loop  is  equal  to 
the  number  of  memory  modules  containing  live  copies  of  data  items  that  were  alive  before  this 
iteration. 

Let  ACU  denote  the  set  of  live  data  items  at  the  start  of  a  given  iteration.  IjCt  the  set 

r'(«)Cr(M)  denote  the  set  of  live  copies  of  u€t/  at  this  time.  Since  u  is  alive,  |r'(M)|  >  c. 

'lire  number  of  live  copies  at  the  start  of  this  iteration  is  given  by  2  |r”(M)l-  fhe  number  of 

u€U 

memory  modules  containing  live  copies  of  live  data  items,  and  thus  a  lower  bound  for  the 
number  of  copies  processed  during  this  iteration  is  given  by  |  r'(/<)|  =  |  U  r'(«)| . 

u€A 

We  first  show  that  a  good  organization  scheme  can  guarantee  that  |r'(^)|  is  not  too 

small. 


b  ^ 

■  LKMIVIA  3.2:  For  every  b>4,  if  m  <  (- — then  there  is  a  way  to  distribute  the 

~  ~  (2er 

2c  —  1  copies  of  each  of  the  m  shared  data  items  among  the  n  modules  s.t.  before  the  start  of 
each  iteration  of  the  'while'  loop  \  V'{A )  |  >  ^  ^-^■(2c  —  1). 


PROOF:  It  is  convenient  to  model  tlic  arrangement  of  the  copies  among  the  memory 
models  in  terms  of  a  bipatitc  graph  G{U,N,h'),  where  U  represents  the  set  of  m  shared  data 
items,  N  the  set  of  n  memory  modules,  and  FCm),  the  set  of  neighbors  of  a  vertex  u£U 
represents  tlie  set  of  memory  modules  storing  a  copy  of  the  data  item  u.  Wc  use  a  probabilis¬ 
tic  construction  in  order  to  prove  the  existence  of  a  good  memory  allocation. 


Let  be  the  probabilistic  ^ace  of  all  bipartite  graphs  GiU,N,E)  S.L 

|f/|  =  w,  |A^1  =  «  and  tlic  degree  of  each  vertex  u£U  is  2c-l.  Give  all  graphs  in  the 
^ce  equal  probability. 
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Say  that  a  graph  GiU,N,E)  €  if  for  all  possible  choices  of  the  sets 

{r'(M):r'(u)Cr(tt).|r(«)|>c.  M€i/}  and  for  all  AQU,  \A  \  <2^^  inequality 

|r'(/l)|  ^  ^O.c-\)\A  I  holds.  This  condition  captures  the  property  that  for  any  set  A  of 
live  data  items,  no  matter  which  of  their  copies  are  still  alive,  the  set  of  all  the  copies  of  data 
items  in  A  arc  distributed  among  at  least  \(2c  - 1)  M  |  memory  modules. 


Pr{  is  not  ‘good’}  <  ]£  (p 


9< 


(2c -1) 


fac-i) 


,2r-l,Hi2c^r  ^  „(1) 

cl  bn  n 


for  m  <  ( )^.  and  6 >4.  □ 

(2er 


In  what  follows  we  assume  that  the  algorithm  is  applied  to  a  memory  organization  that 
possesses  the  properties  proven  in  Lemma  3.2. 

LEMMA  3.3:  If  the  number  of  live  items  at  the  beginning  of  a  phase  is  w  (<k),  then 
after  the  first  s  iterations  of  the  while  loop  at  most  2(1  —  w  live  copies  remain. 

PROOl*':  At  the  beginning  of  a  phase  there  arc  w  live  items,  and  all  their  copies  arc 
alive,  so  there  is  a  total  of  (2c  —  1)h’  live  copies.  Ily  lemma  3.2,  after  s  iterations,  the  number 

of  live  copies  remaining  is  <  (1  — •^)*(2c  — l)w.  Since  |r'(M)|  >  c  for  each  live  item,  these 
can  be  the  live  copies  of  at  most  (1  -  4)*-^ - w  <  2(1  —  -rY  w  items.  O 

DC  0 

COROLLARY  3.2:  Letxt  =  (1  - 

1.  After  the  first  log,,(4c  -2)  iterations  of  the  while  loop  in  a  phase,  at  most  -j-— f 

remain  alive  (establishes  the  fact  that  the  last  phase  has  to  process  no  more  than  k  requests). 

2.  After  log,  2k  ^  log„  n  iterations  in  a  phase,  no  live  items  remain  (establishes  the  correctness 
of  the  last  phase). 
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To  complete  the  analysis,  observe  that  each  group  needs,  during  each  phase  to  perform 
the  following  operations:  broadcast,  maximum  (for  finding  the  latest  time  stamp)  and  summa¬ 
tion  (testing  whether  u,-  is  still  alive).  Also,  before  the  last  phase,  all  the  requests  that  arc  still 
alive  arc  sorted. 


LEMMA  3.4:  Any  subset  of  p  processors  of  the  MPC,  using  only  p  of  the  memory 
modules,  can  perform  maximum,  summation,  and  sorting  of  p  elements,  and  can  broadcast  one 
message  in  O(log  p)  steps. 

PROOl’:  ITie  only  non-trivial  case  is  the  sorting  and  this  can  be  done  using  Leighton’s 
sorting  algorithm  [Le].  □ 


THEOREM  3.2:  For  every  b>4.  ifm 


then  there  exists  a  memory  organ¬ 


ization  scheme  with  efficietKy 


(7(6c(log  c)^  -f-  b(log  nXiog  c)). 

PROOF:  In  each  iteration  of  the  while  loop  each  pixKessor  performs  up  to  one  access 
to  a  memory  module,  and  each  group  of  2c  —  1  processors  computes  the  summation  and  the 
maximum  of  up  to  2c  - 1  elements.  ITius,  each  iteration  takes  (7(log  c)  steps.  'Fhc  first  2c  —  1 
phases  perform  log,,  c  iteration  each,  therefore  together  they  require 

^  (2c-lK/og  c)^ 

I  log  V 

parallel  steps. 

The  sorting  before  the  last  phase  takes  0(log  n)  steps,  and  the  last  phase  consists  of 


Oilogjjn)  while  iterations, 

hence 

requires 

Oiilog^nXlog  c)) 

steps. 

As 

logTj  =  logd--^)-*  =  0(|) 

the 

total 

number  of 

steps 

is 

0(bc(log  c^  bilog  nXlog  c)).  □ 


We  mention  how  to  extend  the  result  of  this  section  to  a  simulation  of  a  CRCW  (con¬ 
current  read  concurrent  write)  PRAM  by  an  Ultracomputcr.  The  CRCW  PRAM  differs  from 
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the  EKEW  PRAM  (defined  in  section  2)  in  having  no  restrictions  on  memory  acce*  When 
several  processors  try  to  write  into  the  same  memory  cell,  the  one  with  the  smallest  index 
succeeds. 

An  Ultracomputcr  is  a  synchronized  network  of  n  processors,  connected  together  by  a 
fixed  bounded  degree  network.  At  each  step  each  processor  can  send  and  receive  only  one 
message,  through  one  of  the  lines  connecting  it  to  a  direct  neighbor  in  the  network.  The  net¬ 
work  topology  enables  sorting  of  «  keys,  initially  one  at  each  processor,  in  0(log  ti)  steps. 

'ITIKOREM  3.3:  Any  program  that  requires  T  steps  on  a  CRCW  PRAM  with  n  proces¬ 
sors  and  m  shared  variables  On  polynomial  in  n).  can  be  simulated  by  an  n  processor  Ultracom¬ 
puter  within  OlTOog  /O^loglog  n)  steps. 

PROOF  (sketch):  ITicrc  arc  two  logical  parts  to  the  simulation  of  each  instruction. 
Both  parts  relay  on  tlic  capability  of  the  Ultracomputcr  to  sort  n  items  in  0(log  «)  steps.  Iltc 
first  part  (which  involves  pre-  and  post-processing)  implements  a  simulation  of  a  CRCW 
PRAM  instruction  by  the  ERKW  PRAM  model.  An  0(log  n)  algorithm  for  this  simulation  is 
described  in  several  papers  (e.g.  (Vi2J).  lire  second  part  simulates  the  MPC  model  on  the 
Ultracomputcr.  We  use  die  local  memories  of  the  individual  processors  to  simulate  the  MPCs 
memory  modules.  The  only  difficulty  in  tliis  simulation  is  to  guarantee  that  no  processor  (as  a 
module)  receive  more  tlian  one  message  at  any  step.  To  achieve  that,  the  memory  request  arc 
sorted  before  each  execution  of  the  ‘while’  loop,  and  only  one  request  for  each  memory 
module  is  executed.  Each  of  die  broadcast,  minimum  and  summation  computation  requires 
0(log //)  steps  on  the  Ultracomputcr  instead  of  die  (2(logc)  steps  it  requires  on  the  Mrc. 
Thus  each  CRCW  PRAM  instruction  is  simulated  by  0((log  «)^Ioglog  «)  Ultracomputcr  steps. 
□ 

We  conclude  this  section  with  some  remarks: 

1.  Fault  tolerance:  A  variant  of  our  scheme,  in  which  every  processor  tries  to  access 
(2-c)c  copies  radicr  than  c,  guarantees  that  even  if  up  to  (l-2e)c  of  the  copies  of  each  data 


-14- 


itcm  arc  destroyed  by  an  adversary,  no  information  or  efficiency  loss  will  occur. 

2.  Explicit  construction:  'Ibc  problem  of  explicit  construction  of  a  good  graph  in 
remains  open.  This  problem  is  intimately  related  to  the  long  standing  open  problem  of  explicit 
construction  of  (wi,n)-conccntrators  (e.g.  [DDPW]),  when  m>n. 

4.  LOWER  BOUNDS 

TIk  fast  performance  of  the  organization  scheme  presented  above  depends  on  having  at 
least  Odog  n)  updated  copies  of  each  data  item,  distributed  among  the  modules.  A  natural 
question  to  ask  here  is  whether  this  redundancy  in  representing  the  data  items  in  the  memory  is 
essential.  In  this  section  we  give  a  positive  answer  to  this  question.  We  prove  a  lower  bound 
relating  the  efficiency  of  any  organization  scheme  to  the  redundancy  in  it  Using  this  trade-off 
we  derive  a  lower  bound  for  any  on-line  simulation  of  idc<il  models  for  parallel  computation 
with  shared  memory  by  feasible  models  that  forbid  it 

We  assume  without  loss  of  generality  that  each  processor  of  the  MPC  has  only  a  con¬ 
stant  number,  d,  of  registers  for  internal  computation.  (This  is  no  restriction  as  can  use  Mi 
as  its  local  memory).  In  what  follows  we  consider  only  schemes  that  allow  a  memory  cell  or  an 
internal  register  to  contain  one  value  of  one  data  item  (no  encoding  or  compres-sion  arc 
allowed). 

I'HEOREM  4.1:  The  efficiency  of  any  organization  scheme  with  m  data  items,  n 

JL 

memory  modules  and  redundancy  r  is  0((— )  '). 

ft 

PROOl'':  Let  S  be  a  scheme  with  m  data  items,  n  modules,  and  redundancy  r.  If  die 
efficiency  of  the  scheme  S  is  less  than  some  number  h  then  there  is  no  set  of  n  data  items 
such  that  all  their  updated  copies  arc  concentrated  in  a  set  of  h~^n  modules.  Otherwise,  it 
would  have  taken  at  least  h  steps  to  read  these  data  items,  since  only  one  data  item  can  be  read 
per  step  at  each  module. 
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Rccall  that  r  is  the  average  number  of  updated  copies  of  a  data  items  in  the  scheme, 

« 

ITicrcforc,  there  arc  at  least  -y  data  items  with  no  more  than  2r  copies.  At  most  dn  out  of 
these  items  appear  in  the  internal  registers  of  processors. 

There  are  ,  I  sets  of  h~^n  modules,  and  each  set  can  store  all  the  copies  of  no 
h  ^n\ 

more  than  n-l  data  items.  If  a  data  item  has  at  most  2r  copies  then  all  its  copies  arc  included 
in  at  least  ~  sets  of  modules.  Counting  the  total  number  of  data  items  witli 
at  most  Ir  copies  that  are  stored  by  the  scheme,  we  get 


Using  the  result  of  theorem  4.1  we  can  now  derive  a  lower  bound  for  the  on-line  simu¬ 
lation  of  a  PRAM  program  by  tlic  MPC  model. 

In  an  on-line  simulation,  die  MIX?  is  required  to  finish  executing  the  PRAM  instruc¬ 
tion  before  reading  the  /-t-1'*.  Of  course  it  can  perform  other  operations  as  well  during  the 
execution  of  the  /'*  instruction,  but  these  can  not  depend  on  future  instructions. 


We  shall  assume,  w.l.o.g.,  that  the  inidal  value  of  all  data  items  (and  all  MPC  memory 
cells)  arc  zero.  Since  we  have  w  data  items  and  n  piXKCSSors,  it  makes  sense  to  consider 

PRAM  programs  of  Icngdi  C(— ),  otherwise  some  items  were  redundant 

n 


THEORKM  4.2:  Any  on-line  simulation  ofT  steps  of  a  PRAM  with  n  processors  and 
m  shared  variables  on  an  MPC  with  n  processors  and  n  memory  modules  requires 

Q{T  ”■  ■)  parallel  MPC  steps. 
loglog  n 
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PROOF:  Wc  will  constnict  a  PRAM  program  of  length  T  as  follows:  The  first  — 

n 

instructions  will  assign  new  values  to  all  the  data  items.  Subsequent  instructions  will  alternate 
between  a  hard  read  and  a  hard  write  instructions. 

Consider  the  redundancy  r,  of  the  scheme  after  the  execution  of  the  instruction.  A 
hard  read  instruction  will  essentially  implement  theorem  4.1  -  it  will  assign  processors  to  read  n 
items  that  all  of  their  updated  eopics  are  conden^d  among  a  small  number  of  modules.  A 
hard  write  instruction  will  assign  new  values  to  the  n  items  with  the  highest  number  of  updated 
copies.  Clearly  there  are  always  n  data  items  witii  at  least  r,  updated  copies  (as  »»>«) 

For  simplicity  consider  each  pair  of  a  hard  read  followed  by  a  hard  write  as  one  PRAM 
instruction.  Ixt  s,  be  the  number  of  MPC  steps  used  while  executing  the  instruction.  For 
m  ^ 

the  first  T  =  —  instructions,  at  most  2  nicmory  locations  were  accessed,  and  hence 
"  1=1 

0) 

Recall  that  is  the  redundancy  when  wc  start  alternating  reads  and  writes.  Let 

1 

/  >  T  =  — .  By  theorem  4.1,  at  least  t  of  tlic  s,  MIXT  steps  were  used  by  each 

processor  to  execute  the  hard  read  instruction.  Hence,  at  most  (s,  -  P,-i)n  cells  were 
accessed  for  write  Instructions.  Also,  tlic  value  of  n  data  items,  with  >r,_i  updated  copies 
each,  was  changed,  thus,  wc  have 

nt  f 

for  /  =  T+ 1, . , .  ,T.  ■  ■ 

Summing  all  these  inequalities  wc  get 

2  n  <  2  'i-i  +  -r  2 

f  =  T+l  l  =  T+l  ^l  =  T+l 

Using  simple  manipulation  wc  get: 
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-rr+  2  s,>  -rT+  2  05/-i+'i-i). 

«  l=T+l  ”  <=T+1 


and  using  (1). 

T  T  T 


2s,  =  2^«  +  2  s,  ^  +  2  ^  +  V(^/+'v)> 

,=1  ,  =  1  ,  =  T+1  "  l=T+l  "  »=T  l  =  T 


Where  2  s,  is  the  total  simulation  time. 


r=l 


1  T— 1 

Lgj  f.  -  - i - y  r,  be  the  average  redundancy  in  the  last  T - steps.  Notice 

(7-_"L),=r  " 

n 

X 

that  p{r)  =  i—y'  is  a  convex  function  in  r,  for  r>0.  Hence  by  Jensen’s  inequality 


IRV.21 1-216], 


Hence, 


I'fi,  =  >(T-fxfy\ 


i-  ± 

OTw  Wvi; 


t=T  l-r  ” 


n  n 


T  h  '08  — 

2s,  >  {T-^rrM^y  )  =  mr-f) — ^). 
!ii  «  "  l0Bl08-2i^ 


loglug- 


For  m  >  1/’^*,  and  T  >  (1-i-c)-^,  the  simulation  time  is  0(7’ 


Hi- 

loglog  n 


).  □ 


5.  CONCLUSIONS 

We  describe  a  novel  scheme  for  oiganizing  data  in  a  distributed  system,  that  admits 
highly  efficient  retrieval  and  update  of  information  in  parallel. 

This  paper  concentrates  on  applications  to  synchronized  models  of  parallel  computation, 
and  specifically  to  the  question  of  the  relative  power  of  deterministic  models  with  and  without 
shared  memory.  Quite  surprisingly,  wc  show  that  these  two  families  of  models  arc  nearly 
equivalent  in  power,  and  therefore  we  justify  the  use  of  shared  memory  models  in  the  design  of 
parallel  algorithms. 
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Therc  are  other  applications  of  our  scheme  that  we  did  not  pursue  in  this  paper.  One 
application  is  to  probabilistic  simulation.  An  interesting  open  problem,  which  we  are  consider¬ 
ing,  is  whether  our  scheme  can  improve  the  probabilistic  results  in  [MV]  or  [UJ. 

Another  application  we  did  not  pursue  here  is  to  asynchronous  systems.  Although  a 
similar  scheme  was  suggested  in  this  context  [Th],  we  believe  that  the  potential  of  this  idea  was 
not  fully  exploited  there,  and  we  plan  to  continue  research  in  this  direction.  However,  wc 
believe  that  the  new  notion  of  consistency  suggested  by  our  scheme  can  have  a  major  impact 
on  the  theory  and  design  of  such  systems,  in  particular  for  distributed  database  systems.  Wc 
intend  to  continue  research  in  this  direction. 
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